Abstract
Introduction: Myelodysplastic syndromes (МDS) are a heterogeneous group of hematologic neoplasms often asymptomatic in early stages, with initial suspicions arising from subtle complete blood count (CBC) abnormalities. CBC is one of the most commonly performed laboratory tests worldwide. The aim of this study was to develop and pilot-implement a machine learning (ML) algorithm to identify patients at high risk of undiagnosed MDS using only CBC data. The project was conducted between January 2024 and December 2024 and was supported by Bristol Myers Squibb Poland.
Methods: We developed a one-class classification model (the Maximum Sum of Evidence algorithm) using an anonymized electronic health record (EHR) dataset comprising 32,850 patients. This project was approved by the Rzeszow University Ethical Commission (ethical assessment No. 2024/05/025). The dataset included 4,393 confirmed MDS cases (identified by ICD-10 codes and verified by Saventic Medical Team) and 28,457 non-MDS controls, matched for age (18-100 years) and sex from 14 medical centers. Data within a 6-month pre-diagnosis to 2-day post-diagnosis window was used for MDS patients, with a comparable observation window for controls. The model utilized age, sex and 13 CBC results (RBC, HGB, HCT, MCV, MCH, MCHC, RDW, WBC, NEUT, LYMP, MONO, EO, BASO, PLT). Patients with MDS diagnoses or conditions mimicking MDS in CBC (e.g. other hematologic malignancies, alcoholism, liver cirrhosis, chronic kidney disease, hemodialysis, ongoing chemo/radiotherapy) were excluded based on ICD 10 codes. Data were aggregated (first value) and discretized (20 quantile-based bins). To select differential features the Kolmogorov-Smirnov test was applied. The model was trained within an 80/20 train-test split and 100-fold stratified cross-validation on the training set. Performance on the test set yielded a recall of 77.74%, specificity 93.08%, precision 62.81%, F1 score 69.48%, AUPRC 74.90% and AUROC of 94.66%. Neutrophils, monocytes, mean corpuscular volume, platelet count, and hematocrit emerged as the 5 most informative features.
Results: The algorithm was deployed in 4 Polish medical centers, analyzing 2,475,706 anonymized patient EHRs over ten monthly screening cycles. The algorithm flagged 95 high-risk MDS patients. Following de-anonymization, expert clinical assessment at the medical centers, and diagnostic workup triggered by the algorithm's prediction, 8 patients received a new diagnosis of MDS as a direct outcome of the screening. Additionally, 11 flagged patients were already undergoing diagnostic evaluation, and 8 had a prior MDS diagnosis but these information were not included in their uploaded EHRs. MDS was excluded in 40 patients, 6 remained under diagnostic evaluation, and 22 were excluded due to lack of consent or contact. The precision achieved by the model was 40.29%. Interestingly, no statistically significant differences in CBC parameters were observed between confirmed and excluded MDS cases, highlighting the importance of contextual clinical data.
Conclusions: Our multi-center study establishes the practical application of an ML algorithm in identifying patients at high risk for undiagnosed MDS, solely based on CBC data. This system directly contributed to new MDS diagnoses and identified a substantial group of additional patients already undergoing diagnostic evaluation, although this was not apparent from the analyzed records. This highlights the system's strong potential to detect individuals early in their diagnostic journey. The delayed feedback on high-risk status for some patients was primarily due to infrequent hospital data updates (only every 2–3 months), which significantly limited the system's ability to deliver timely prognostic alerts. Although initial single CBC analysis proves valuable, the limited differentiation in CBC parameters between confirmed MDS and excluded cases emphasizes the need for comprehensive clinical data. To further optimize performance, we are currently expanding exclusion criteria to minimize false positives and developing a time-series model utilizing longitudinal CBC data. The project has now been scaled up to 8 medical centers, with monthly EHR data updates. The long-term objective is to enable real-time data integration to support physicians with timely, AI-driven diagnostic alerts.
This feature is available to Subscribers Only
Sign In or Create an Account Close Modal